Ramiro Lucero - PostGIS & Tableau

VAST 2011 Challenge
Mini-Challenge 1 - Characterization of an Epidemic Spread

Authors and Affiliations:

Ramiro Lucero, University of Buenos Aires, ramiroalucero@yahoo.com.ar

Tool(s):

The tools used for working on this challenge were Postgresql 8.4 database with the PostGIS 1.4 module, gvSIG 1.11 and Tableau Desktop 6.0 software.

Postgres and PostGIS where used for handling the information provided and being able to perform spatial analysis on it. gvSIG is a desktop GIS software and it was used, first, to geo-reference the Vastopulis image, and second, to digitalize all the main features on this image into an ESRI shapefile format. The generated shapefiles were also loaded into Postgresql. Tableu Desktop was used for visualizing and analyzing the data.

Postgres is an open source object-relational database system and can be found at http://www.postgresql.org/.

gvSIG is an open source GIS system programmed in Java language. The gvSIG  project webpage is http://www.gvsig.org.

Tableau Desktop is a commercial software for performing information visualization. Tableau Software company provided a demo license to all the students taking the Information Visualization course at the University of Buenos Aires. More information of this software can be found at http://www.tableausoftware.com. 

 

Video:

 

Here is the link to the explanatory video. 

 

ANSWERS:


MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.

The outbreak starts in Vastopolis Downtown area the 18th of May at about 8 am.

 

Figure 1.1.1

 

Figure 1.1.1 shows that there are two outbreaks. The first one mainly in the Downtown area (and less intense in the Uptown and Eastside areas) and the second one in the Plainville and Westside areas.

 

Plot of the messages containing acute symptoms for the first 8 hours after each outbreak:

 

Figure 1.1.2

 

The origin of the outbreaks would be inside the green circle. The red line denotes the outline of the most affected areas. The first outbreak seems to be transmitted by air, and the wind on the 18th was coming from the west, so its origin should be at the west side of the white trend lines. The second outbreak is transmitted apparently by water through the Vast River, so its origin should be up the river where the people got sick.   


MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.

The diagram on Figure 1.2.1 shows the different stages in the process of solving this Mini Challenge.

Figure 1.2.1

First, all the tables provided where loaded into Postgresql database. The fields were correctly formatted and the links between tables generated. Afterwards, the Vastopolis image was geo-referenced and re-projected to UTM Zone 15 North projection to be able to work in metric coordinates instead of degrees. All main features from the image were digitalized using gv-SIG software. These features were then uploaded to Postgresql. All the coordinates of the messages (microblogs) provided were re-projected as well to UTM Zone 15 North (End of step (1) - Figure 1.2.1).

Second, I loaded the data into Tableau and started making some visualizations and learning how to use this software. The first visualization I did was plotting de coordinates of the messages over the map, showing only the messages that had the word flu on them and showing one day at the time. This didn't work well as to answering the challenge questions, but helped to realize that something happened the 19th of May, and that on the 20th of May many people were distributed in Vastopulis hospitals. I noticed that there were many false positives, messages of people that were not ill and were thought to have flu. I needed to better filter the messages of the people that were ill from the ones who weren't, and also I needed to detect messages of people starting to exhibit the symptoms, instead of people that already were diagnosticated with flu, in order to be closer to finding when and where the outbreak started (End of step (2) - Figure 1.2.1).

In order to correctly filter the messages of people that were starting to have symptoms from the rest, I counted how many words related to the flu symptoms ( flu, fever, chills, sweats, aches,  pain, fatigue, cough, breath,  nausea, vomit and diarrhea) were present in each message. For this I used Postgresql tsvector and tsquery text search. I also counted the words death , ill , sick and killingme. Using PostGIS functions I assigned to each message the zone from where it was sent (Downtown, Eastside, etc.) and also calculated its distance to the main features digitalized from Vastopulis image (hospitals, roads, etc.). I made two categories for the words that were counted: Symptoms and Symptoms_acute. Symptoms grouped all the words counted, and Symptoms_acute only the following: fever,  chills,  sweats,  aches,  pain,  fatigue,  cough,  breath,  nausea,  vomit,  diarrhea and  killingme. I made this distinction between words, because some of the words counted were also used in texts that were not speaking of illness. Symptoms_acute have more reliable words in that sense. I assigned to each message the quantity of words counted from each category. Afterwards, for each ID I generated the variable symp_add  with the total quantity of words sent from the category Symptoms, and the variable symp_acute_add with the total quantity of words sent from the Symptoms_acute category; this was done for all the messages in the database. Based on the distributions of these two variables I created the following variable for each ID:

had_flu_high =           1, when symp_add>2 or symp_acute_add>1          

0, otherwise.              

This new variable helps me decide which ID has more chances of really having caught the flu.

The last variables generated for each ID where the date and time of the first message sent with a Symptom word.

After having created these new variables I started generating visualizations again with Tableau. Finally, I arrived to the following plot which made me understand what was happening in Vastopolis.

Figure 1.2.2

Figure 1.2.2 is essentially the same that Figure 1.1.1. I arrived at this plot after dynamically filtering the IDs and keeping not only the ones having had_flu_high=1, but also having their first message containing Symptoms words with two or more Symptoms_acute words. Basically, people with had_flu_high=1 and that the first time that they send a message talking about the symptoms they write two or more acute symptoms, they have an even higher probability of being really ill. This filter reduces significantly the number of IDs, but highly guarantees that the IDs selected are ill.  Once we have detected the ill people, it's easier to see the pattern of the disease (End of steps (3),(4),(5) - Figure 1.2.1).

Figure 1.2.2 shows an outbreak the 18th of May at 8AM mainly in the Downtown area and with less intensity in the Uptown and Eastside areas. The quantity of people affected starts to diminish abruptly the 18th at 8PM, but the 19th at about 2AM, a new outbreak starts in the Plainville and Westside areas. The rate of infections remains quite stable until the last day registered in the dataset. This indicates that the first outbreak is controlled, but the second one isn't.  

Figure 1.2.3: May 18th outbreak. Messages from 8AM to 6PM.

 

In Figures 1.2.3 it can be seen the plume of messages of people feeling ill. Wind is coming from the west, and this suggests that the disease is spreading through air.

Figure 1.2.4: May 19th outbreak. Messages from 2AM to 10AM.

In Figure 1.2.4 it can be seen that during the second outbreak, messages of ill people are concentrated along the Vast River. The symptoms in this outbreak are different from the first, they are more related to gastric problems (vomit, diarrhea and nausea) while in the first outbreak symptoms were more related to fever, chills, sweats and respiratory problems. This evidence suggests that the method of transmission of the second outbreak is water.

Figure 1.2.5: May 20th Messages

As we can see from Figure 1.2.5, during the last day the outbreak in Downtown area is controlled and the one along the Vast River is still active. More people is getting sick along the river with gastric symptoms. The river flows outside the image, and assuming that the disease is transmitted by water, it would be necessary for emergency management personnel to deploy treatment resources outside the affected area.